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Abstract 

Given two measurable spaces H and D with countably generated cr-algebras, a 
perfect prior probability measure Pjj on H and a sampling distribution S : H — > D, 
there is a corresponding inference map I: D — > H which is unique up to a set of 
measure zero. Thus, given a data measurement /i: 1 — >• D, a posterior probability 
Pjj =Io fx can be computed. This procedure is iterative: with each updated prob- 
ability Pjj, we obtain a new joint distribution which in turn yields a new inference 
map I and the process repeats with each additional measurement. The main re- 
sult uses an existence theorem for regular conditional probabilities by Faden, which 
holds in more generality than the setting of Polish spaces. This less stringent setting 
then allows for non-trivial decision rules (Eilenberg-Moore algebras) on finite (as 
well as non finite) spaces, and also provides for a common framework for decision 
theory and Bayesian probability. 
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1 Introduction 

Bayesian probability is a subject that has proven very successful in prediction, infer- 
ence and model selection [3l HOI CD] ■ Cencov [20] gives a categorical foundation for non- 
Bayesian statistical inference, but as far as the authors are aware, a categorical framework 
for Bayesian probability has not been fully developed. Lawvere took the first steps in 
this direction by defining the category V of probabilistic mappings in the unpublished 
manuscript [14J. Following this, Lawvere and Huber [15] gave a seminar in Zurich on 
Bayesian Sections, further developing V as a basis for Bayesian probability. The first 
appearance in the literature was an expansion of these ideas by Giry [9J, who showed 
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that the endofunctor T: M.eas — » Aieas associated to the probability adjunction given 
by Lawvere forms a monad, and that V is the Kleisli category of that monad. 

Subsequently, Meng [16], examined the category of convex sets and affine linear maps, 
which can be shown to be equivalent to the category of Eilenberg-Moore algebras of the 
Giry monad. This category can be thought of as the category of "decision rules" since the 
objects of that category are certain measurable functions TX — y X whose fibers partition 
the space of probability measures on a given space X into positive convex measurable setsQ 
Based on the work by Giry restricting the monad to Polish spaces, Doberkat[6lH] has since 
characterized the Eilenberg-Moore T-algebras for these topological spaces. That work, 
however, was based upon giving the space of probability measures the a-algebra generated 
by the weak topology (as used for the Polish space monad) which results in (nontrivial) 
finite spaces having no T-algebras. In the final section, we show that this negative result 
can be circumvented by avoiding topological conditions and using the initial a-algebra 
generated by the evaluation maps. Others, including Wendt [21], van Breugel [T9J and 
Abramsky-Blute-Panangaden [1] have also studied similar constructions. 

In this paper, we show that Bayesian probability theory can be given a categori- 
cal foundation using the category V if we only assume that conditioning is based on 
points, measurable spaces have countably generated cx-algebras and we restrict to perfect 
measures. The main theorem (Theorem |4.ip states that inference maps are uniquely de- 
termined by a prior probability and a sampling distribution. This result follows from the 
existence of regular conditional probabilities, and we restate an existence theorem (Theo- 
rem EH]) of Faden jS], which relies on perfect measures on countably generated measurable 
spaces instead of the more restrictive Polish spaces. Using our characterization of Bayesian 
probability in V and the fact that the category V embeds into the category of T-algebras, 
we can then see that the category of decision rules provides a common framework for both 
decision theory and Bayesian probability. Some of these ideas are similar in spirit to the 
general notion of distributions based on commutative monads found in the recent paper 
of Kock p2]. 

The authors would like to express gratitude to F.W. Lawvere, P.F. Stiller and T. 
Nguyen for many fruitful conversations regarding these ideas. This work was partially 
supported by AFOSR, for which the authors are extremely grateful. 

2 The Category of Probabilistic Mappings 

We begin with an overview of the category of probabilistic mappings and recall several 
fundamental results that are widely known. Let us fix the following notation. We will 
denote the cr-algebra of a measurable space X by Ex, and the category of measurable 
spaces and measurable functions by M.eas. For an object (X, S^) in Aieas we will often 
drop the associated a-algebra from the notation and denote it simply by X when the cr- 
algebra is obvious or inconsequential. We will use (1, Si) and (2, E2) for the one-element 



1 Doberkat [5] refers to this process of making decision rules as derandomization, which in certain 
applications like probabilistic semantics may be a more appropriate terminology. 
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and two-element measurable sets with the discrete a-algebras, but we will similarly just 
write "1" or "2" when these are used as objects in some category. 

The category of probabilistic mappings V has measurable spaces (X, Ex) as objects 
and an arrow between two such objects /: (X, Ex) — > (Y, Ey) consists of a function 
/: X x Ey ->■ [0, 1] such that 

(i) for all B G Ey, the function /(•, B) : X — > [0, 1] is measurable, 

(ii) for all the function f(x, ■) : Ey — > [0, 1] is a probability measure on Y. 

That is, morphisms in V can be thought of as parametrized families of probability mea- 
sures that vary measurably. For an arrow /: (X, Ex) — > {Y, Ey) we will often denote the 
function /(•, B) : X — > [0, 1] by fs and the function f(x, •) : Ey — > [0, 1] by f x . 
Given two arrows 

(X,E x )4(F,Ey)4(Z,E z ) (1) 
the composition g o /: X x E z — > [0, 1] is defined by 

(gof)( x ,C)= [ g c (y)df x . (2) 

JyeY 

The associativity of this composition follows easily from the monotone convergence the- 
orem. An important fact is that every measurable function /: X — > Y may be regarded 
as a "P-morphism 5f\ X — > Y, where the Dirac (or one point) measure 

SAx , B) J l (3) 

assigns to each x G X the Dirac measure on Y which is concentrated at f(x). Taking the 
measurable function / to be the identity map on a particular measurable space X gives 
the arrow 5id x : (X, Ex) — > (A, Ex), i-e., the identity arrow for X in V. In fact, it is 
easy to check that the association / i— >■ 5f determines a functor 5: Aieas — > V taking a 
measurable space to itself. Note that this functor is not faithful, however, and so we do 
not get an embedding of M.eas into V. We will call a V arrow P : X — > Y deterministic 
if for every B G Ey the measurable functions Pb ■ X — > [0, 1] assume only the values or 
1. The following proposition characterizes deterministic arrows in V . 

p 

Proposition 2.1. Let (Y, Ey) be a countably generated space. A V arrow X — > Y is 
deterministic if and only if it is determined by a measurable function f with P — Sf. 

Proof. If P = 5f for some measurable function then it clearly only assumes a value of 
either or 1 and so is deterministic. Conversely, let Q be a countable generating set for 
Ey. Then 

g = {G I G G Q or G c G U {Y, 0} (4) 
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is also a countable generating set. Suppose for each B G Q that Pb assumes a value of 
either or 1 on X. We claim that there exists a measurable function / on X, such that 
for each B G Ey the diagram 




commutes, where xs is the characteristic function. 
For each x G X let 

G x = {BeG\ P B {x) = 1} and B x = f| S (5) 

Now we need to see that B x ^ 0. Clearly Q x is not empty since P x is a probability measure 
on Y and so Py(x) = 1. Moreover, if 5^ = 0, then 

y = s * = U ^ ( 6 ) 

but the additivity of P x and the fact that P x (B c ) = for all B G Q x shows that this gives 
a contradiction. Thus, we can choose any set function /: X — > Y such that f(x) G B x 
for all x G X, and then the condition P B = xb ° / holds for all B G Q. It remains to 
prove that / is measurable and it is enough to check this on Q. Then for any B G Q, 
we have f~ l (B) = f^ 1 (xb 1 ^}) = i\X\) ^ since Pb is measurable. Since Q is a 
generating set for the a- algebra, we can extend this to all of Sy and P — Sf. □ 

The following lemma gives two useful properties which follow easily from standard 
exercises in measure theory and the definition of composition in V . 

Lemma 2.2. The composite 

t P f 

X > Y > Z 

satisfies (/ o S p )(x, U) = f p ( x )(U) for all x G X and all U G X^, while the composite 

f 5 P 
X ► Y > Z 



satisfies (5 P o f)(x, U) = f x (p 1 U) for all x G X and all U G S z . 
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There are several distinguished objects in P that play an important role in many con- 
structions. Any set X with the indiscrete cx-algebra Ex = {X, 0} is a terminal object since 
any arrow P : Y — > X is completely determined by the fact that P y must be a probability 
measure on X. We denote the canonical terminal object by 1 since it is isomorphic to the 
one-element set. Notice that an arrow P: 1 — > X is precisely a probability measure on 



Proposition 2.3. The terminal object is a separator for V . 

Proof. Given any two parallel arrows f,g:X—>Y such that the composites 

P f 

1 > X I Y 

9 

are equal for all probability measures P on X it follows for each x G X that / o 8 X = go 5. 



In addition to having a separator, the category P also has a coseparator. Using 
notation from classical logic, let 2 = {T, _L} and £2 be the discrete algebra on 2. 

Lemma 2.4. A V -arrow P : X — > 2 is equivalent to a measurable function f : X — > [0, 1]. 

Proof. Because P x is a probability measure on 2, we have P(x, 2) = 1 and P(x, 0) = 
for all x G X. Moreover, we have P (x, {-L}) = 1 — P (x, {T}) and so P is completely 
determined by P{t}'- X — > [0,1]. Since there are no additional restrictions on P{j}, we 
get a bijection homp(X, 2) ~ hom_A4 eas (X, [0, 1]). □ 

As a consequence of Lemma l2~4"t a P-morphism with codomain 2 is sometimes denoted 
by /, when there is a need to distinguish the P-morphism and the measurable function 
/ with codomain [0, 1]. Now we are able to prove the following useful proposition. 

Proposition 2.5. The object 2 is a coseparator for V . 

Proof. Given two parallel arrows f,g:X—}Y,iffj£g then there exist an x G X and 
a B G Sy such that f(x,B) 7^ g(x,B). The arrow xb determined by the characteristic 
function on B coseparates / and g. □ 

Finally, we briefly show how the Giry monad factors through V. Let 7X denote the 
set of probability measures on X, endowed with the coarsest a- algebra such that the 
evaluation maps evs- 7X — > [0,1] given by et>s(P) = P(B) are measurable. Then we 
can define a functor CP: V — > Aieas which sends a measurable space X to the space CPX of 
probability measures on X. On arrows, CP sends the P-arrow / : X — > Y to the measurable 
function CP/ : CPX — >• 7Y defined pointwise on Sy by 



X. 



Thus f x = g x for all x G X and we have that f = g. 



□ 
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That is, 7f(P) gives the probability measure on Y defined by the composition 

foP 



(9) 



in V. Since CPA = homp(l, X) as sets, another common notation for CPA is X 1 , but we 
will use the functor notation for clarity. 

In fact, CPA is an important object in V as well, and based on the definition of the a- 
algebra on CPA, we can define the evaluation morphism ex '■ CPA — > X by Ex{P, A) = P{A). 
With this, we are able to characterize the relationship between M.eas and V, first proved 
in [9]. 

Theorem 2.6. The functors 5: Aieas — > V and CP: V — > Aieas form an adjunction 

5 

Meas ■ V 6 H CP 

CP 

with the unit of the adjunction given by r]x{x) = and the counit ex'- 7X — > X . 

Thus, we can realize the Giry monad as the composition T = CP o 5, and moreover, V 
is equivalent to the smallest category through which T factors — i.e., it is equivalent to the 
Kleisli category K(T) of the Giry monad. Hence every V arrow P : X — > Y corresponds 
uniquely to a measurable arrow X — > TY. 



3 Joint Distributions and Conditionals 

Given a family of objects {Xi} ie j we can form the cartesian product rLe/^i an d endow 

this set with the product cx-algebra generated by all the projection maps Yliei Xi Xj, 
one for each index j G /. It is easy to see that 




does not give a categorical product. In fact, only weak products and equalizers exist in V, 
as the uniqueness condition fails for both constructions. We use the terminology "product 
space" to denote the set product of any family {(Aj, S^Jjie/ of objects with the product 
a-algebra and not to imply that that object (Ilier^j ®i&i^Xi) with projections satisfies 
any universality condition. We will call a probability measure J: 1 — > (Hiei^i) ®iei^Xi) 
on a product space a joint distribution. We do not mean to imply that these are distribu- 
tions of a random variable, but rather indicate a measure on a product space which is not 
necessarily a product measure. These joint distributions are the main objects of study in 
Bayesian probability. 
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Given any joint distribution J: 1 — > Yl ieI X^ for each j E I we have the diagram 



1 




where the composite 5 nj o J is called the marginal (distribution) of J on component Xj 
and is given by (8^ o J){Aj) = JirfAj) by Lemma O 

Given only the probability measures on the components, {Pj}j g /, there are many 
joint distributions on the product space whose marginals are the given family {Pj}j S j. 
By using relationships in the form of conditionals between the components, we bring 
into play additional knowledge that permits the determination of the appropriate joint 
distribution. If the uncertainty of component Xj, as expressed by a probability measure 
Pj on component Xj, depends conditionally on a parameter which varies over component 
Xi then we have the V arrow h: X{ — >■ Xj. These conditionals — which are the morphisms 
in V — are the key to determining a unique joint distribution. The relationship between 
the components Xj and Xj is mediated by the conditional h and expresses the relationship 
Pj = ho P t . 

3.1 Constructing a Joint Distribution Given Conditionals 

We now show how marginals and conditionals can be used to determine joint distributions 
in V. This development follows that of [1] where we first learned of this approach (the 
category Stock of stochastic kernels in that paper is precisely what we call V). Given a 
conditional probability measure h: X — > Y and a probability measure Px '■ 1 — > X on X, 
consider the diagram 



1 
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where Jh is the uniquely determined joint distribution on the product space 1x7 defined 
on the rectangles of the a-algebra Y, x ® £y by 



J h (AxB) = [ h B dP x 

J A 



(13) 



The marginal of Jh with respect to Y then satisfies 5 ny o Jh = hoP x and the marginal of Jh 
with respect to X is Py. By a symmetric argument, if we are given a probability measure 
Py and conditional probability k: Y — > X then we obtain a unique joint distribution J fe 
on the product space X xY given on the rectangles by 



J k {AxB) = [ kj 
Jb 



dQ. 



(14) 



However, if we are given P x , Py,h,k as indicated in the diagram 

1 




(15) 



X * 



then we have that Jh = Jk if and only if the compatibility conditions 



Px 
Py 



koPy 

hoP x 



(16) 



are satisfied. Thus if the compatibility conditions are satisfied, then we can realize the 
product rule of probability in V as 



/ h B dP x = J(Ax B) = / k A dP Y . 
J a Jb 



(17) 



In the extreme case, suppose we have a conditional h: X — > Y which factors through the 
terminal object 1 as 

h 

(18) 



X 



Y 



Q 



where ! represents the unique arrow from X — > 1. If we are also given a probability 
measure P: 1 — >■ X, then we can calculate the joint distribution determined by P and 
h = Qo\ as 

J(AxB) = J A (Qo\) B dP 
= P(A)-Q(B) 



(19) 
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so that J = P <S> Q. This is precisely the situation where we say that the marginals P 
and Q are independent. Thus in V independence corresponds to a special instance of a 
conditional — one that factors through the terminal object. 



3.2 Constructing Regular Conditionals given a Joint Distribu- 
tion 

The following result is the basis from which the inference maps in Bayesian probability 
theory are subsequently constructed. Many textbooks restrict to Polish spaces in order to 
prove the existence of regular conditional probabilities (e.g., see [7]), though this is not a 
necessary condition. Several more general characterizations of conditions which guarantee 
the existence of regular conditional probabilities have been found, either restricting the 
spaces involved or the joint distributions which are allowed. In [T7], Pachl does not require 
even countably generated cr-algebras, but relies instead on a certain notion of compactness. 
We will prefer to follow [8], where Faden gives a necessary and sufficient condition when 
we restrict to countably generated spaces. Namely, the marginals of a joint distribution 
must give perfect measure spaces^] The class of perfect measures is broad and includes, 
for example, all Radon measures. The proof of the following theorem can be found in [8], 
where several equivalent conditions are identified. 

Theorem 3.1. Let (X, Ex, -fx) be a perfect countably generated probability space and 
(F, Ey) a measurable space. If J is a joint distribution on X x Y with marginal Px = 
5 7TX o J on X , then there exists a V arrow f that makes the diagram 




commute and satisfies 

[ 5 7TYC dJ= [ f AxB dP Y . (21) 
Jaxb Jc 

Moreover, the morphism f is the unique V-morphism with these properties, up to a set of 
Py -measure zero. 

Interestingly, we can use Theorem 13.11 to obtain a seemingly stronger statement, i.e., 
that the regular conditional probability factors through the product. Though this is not 
difficult to prove, we will prefer this stronger statement in the sequel. 



2 A measure space (X, E, fi) is called perfect if for any measurable function /: X — > K, there exists a 
Borel set E C f{X) such that n{f~ l {E)) = fi(X). 



3 JOINT DISTRIBUTIONS AND CONDITIONALS 



10 



Theorem 3.2. Let X and Y be countably generated measurable spaces and J a joint 
distribution on A x Y with marginal distributions Px and Py on X and Y such that 
(X, Ex, Px) and (Y, Ey, Py) are perfect probability spaces. Then there exist V arrows f 
and g such that the diagram 



1 




°f 

commutes and 

[ (<W o g) B dP x = J(AxB)= [ (5^ O f) A dPy. (23) 
J A JB 

Proof. Since (A, Ex,Px) and (Y,Ey,Py) are countably generated and perfect, so also 
is (A x Y, Ex <8> Ey, J). Countable generation is obvious, while perfection follows from 
Theorem VIII of Ryll-Nardzewski in [TS] . Now we can apply Theorem 13.11 to see that 
there is a P-arrow / : Y — > X x Y satisfying J = f o P Y such that 

/ f A >cBdP Y = J(Ax(BnC)). (24) 
Jc 

Then from Lemma [2.21 we know that (5 nx o f)(y, A) = f(y, A x Y) and so 

/ {S Vx of) A dPy= [ fAxYdPy (25) 
JB JB 

= J(Ax(Yf]B)) (26) 
= J (A x B) (27) 

Similarly we obtain a "P-arrow g: X X x Y satisfying J = g o P x and 

/ \^ Y og) B dP x = J {Ax B). (28) 

J A 

With these facts, it is a simple exercise to check that the diagram commutes. □ 

Note that if the joint distribution J is obtained by a probability measure Px and a 
conditional h : A — > Y using the method described by Diagram |T2l then using the above 
result and notation it follows Px-a.s. that h = 5^ Y o g. 
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Remark 3.3. (Tonneli's Theorem) Let A and Y be countably generated measurable 
spaces. Given the joint distribution J on X x Y with perfect marginals Px and Py 
let 7 : A — » X x y and 9? : K — >• A x Y be the P arrows satisfying 7 o = J and 
99 o P Y = J whose existence is guaranteed by Theorem 13.21 Given any measurable function 
F: X x Y — > [0, 1] we have the diagram 




(29) 



F o 7 and g = F o tp so 



(30) 



where the top two triangles commute. Thus we can define / - 
that the entire diagram commutes. From this, it follows that 

fdP x = [ FdJ = [ gdPy 

X JXxY JY 

and we can realize Tonneli's Theorem as the special case with J = Px <8> Py- 

This formulation provides the context for the following optimal transportation prob- 
lem: given marginals Px and Py that model the supply and demand constraints, and a 
cost function F (defined up to a scalar constant) representing the unit cost to transport 
a product from x G A to y G Y, what joint distribution J on A x Y with marginals 
Px and Py minimizes the objective function J XxY F dJl The optimal assignment is then 
the conditional probability X — > Y determined by the optimal joint distribution J. For 
example, this problem is investigated and a unique solution is given in certain cases in 



4 Bayesian probability in V 

If we replace A and Y in Diagram [22] by D(ata) and if (ypotheses), and the composites 
Siry g by iS(ampling distribution) and 5 7TX o f by I(nference), then we can define Pjj = 

5 o P H to obtain 

1 
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In the context of Bayesian probability the probability measure Ph is often called a prior 
probability. 

In this notation, the product rule in V given in Equation [17] becomes 



where "H G and D e We will spend the remainder of this sections showing how 
this interpretation of these spaces in V provides a categorical foundation for Bayesian 
probability. First, we briefly review the fundamental concepts in Bayesian probability 
theory which can be found in [TT] and then proceed to show how V is the appropriate 
category for this theory. Generally, a Bayesian model is comprised of a number of items 
including 

(i) two measurable spaces H and D representing hypotheses and data, respectively, 

(ii) a probability measure Pg on the H space called the prior probability, 

(iii) a V arrow S : H — >■ D called the sampling distribution, 

(iv) a V arrow X: D — > H called the inference map, 

Note that the data space D can also be thought of as the event space for some experiment, 
and the cx-algebra on D is determined by distinguishable data. The prior probability 
measures are updated via the inference map as one takes measurements, which correspond 
to probability measures |i on D. These updated probability measures are then called 
posterior probabilities and are given by Ph = T o u,. The posterior Pjj then becomes the 
prior probability for the next step and the process continues as more measurements are 
taken. At each step in a Bayesian process, the posterior probability is a representation of 
knowledge about the hypotheses based on all of the data that has been accumulated up 
to that point. 

Using the sampling distribution S and the prior probability Ph on H, we can define 
a joint distribution J on the product space H x D as in Section I3TT1 by defining it on the 
rectangles as J {A x B) = J a Sb dPn- The D-marginal (prior probability on data) is then 
Pd = S o P H = 5 nD o J. Using Theorem I3.2[ we have the following theorem. 

Theorem 4.1. Given a prior probability Ph and a sampling distribution S, the inference 
map X: D — >• H is determined uniquely up to sets of Pp -measure zero. 

Proof. Let X: D — > H be the composition 5 nH o /, where 5 nH is the projection H x D — >■ H 
and / : D — > H x D is the "P-arrow satisfying 




(32) 




(33) 



whose existence is given by Theorem 13.21 Thus 




(34) 
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and this inference arrow X is unique in that if X' also satisfies equation [3H then the set 
{y e Y | I y 7^ X'y) has Pp-measure zero. □ 

Thus the complete process works in the following way. A prior probability Ph and 
sampling distribution S are specified, from which one determines the inference map X. 
Once measurements 1 — > D are taken, we then calculate the posterior probability by 
X o fi. This updating procedure can be characterized by the diagram 



1 



H 




where the solid lines indicate arrows given a priori, the dotted line indicates the arrow 
determined using Theorem I3.2[ and the dashed lines indicate the updating after a mea- 
surement. Note that if there is no uncertainty in the measurement, then /i = 5{ x y for some 
x G D, but in practice there is usually some uncertainty in the measurements themselves. 

Following the calculation of the posterior probability, the sampling distribution is 
then updated, if required. The process can then repeat: using the posterior probability 
and the updated sampling distribution the updated joint probability distribution on the 
product space is determined and the corresponding (updated) inference map determined. 
We can then continue to iterate as long as new measurements are received. For some 
problems (such as with the standard urn problem with replacement of balls) the sampling 
distribution does not change from iterate to iterate, but the inference map is updated 
since the posterior probability on the hypothesis space changes with each measurement. 
The model selection problem (either once at the beginning of this process, or iteratively 
throughout) can also be modeled as a meta-Bayesian process, where the hypothesis space 
is the space of potential models and the data constitutes some information that would 
inform on the suitability of a given model. 

Remark 4.2. We know from Theorem 14.11 that the inference map X is uniquely deter- 
mined by Ph and S up to a set of P^-measure zero. However, there is no reason a priori 
that a measurement \i: 1 — >• D is required to be absolutely continuous with respect to P D . 
In fx is not absolutely continuous with respect to Pq, then a different choice of inference 
map X' could yield a different posterior probability — i.e., we could have Io/i/I'o|i. 
Thus we make the assumption that measurement probabilities on D are absolutely con- 
tinuous with respect to the prior probability Pjj on D. This is a reasonable assumption, 
however, since if a data event is impossible (has Pp-measure zero) under a certain model, 
then the model should not be expected to make an meaningful inference when presented 
with that data. On the other hand, it is easy to see that if a measurement fi <C Pd, then 
X o fx <c Ph, as expected. 
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We emphasize that this procedure can be employed for any perfect prior probability 
and any regular conditional probability. For example, given a perfect prior P: 1 — > 7X 
and the conditional ex '■ 7X — > X there corresponds a unique inference map X: X — > 7X 
satisfying, for all A G £x and for all B G Syx, 

/ e XA dP= [ l B d(e x oP). (36) 

In the case where X = 2 = {T,J_} (the two element set), these "higher order distri- 
butions" P : 1 — > 72 can be used to explicate the concept of A p distributions as char- 
acterized by Jaynes [TTJ Chapter 18]. Using our notation, a proposition A, which is a 
morphism A: 1 — > 2 in the category Set of sets, has an associated probability of truth, 
say Pr(A) = p. Hence A determines a P-morphism A: 1 — > 2, with A({T}) = p. The 
information supplied by the arrow A consists only of the single value p and fails to in- 
dicate how sensitive this proposition is to additional data. The confidence that one has 
in the value p can be supplied by the higher order distributions which are probability 
measures on the space CP2 of probability measures on 2. Since 72 consists precisely of the 
Bernoulli distributions B e = 95 T + (1 - 6)5±, where 9 G [0, 1], it follows that 72 ~ [0, 1]. 
Consequently, any distribution 

1 \ 72 (37) 
has an expected value which can be calculated using the composition 

1 ^ 72 % 2. (38) 

Thus E(A P ) = (e 2 o A P )({T}) = p and any such distribution provides a more informative 
measure. For example, the two distributions on 72 ~ [0, 1] specified by p = 5i and 
p' the uniform (Lebesque) measure both have expected value |. Yet clearly, the first 
is deterministic, expressing a (complete) confidence in the statement that the expected 
value of the proposition 

1 2 (39) 

where A(T) = | is |. On the other hand, the distribution p' also determines A, but 
instead expresses a maximal ignorance modeled by the uniform distribution. 

5 The Category of Decision Rules 

Recall that the Giry monad T factors through V via the adjunction of Theorem 12.61 At 
the other end of the spectrum of categories through which the Giry monad factors is the 
Eilenberg-Moore category Aieas T , consisting of the Eilenberg-Moore algebras of the Giry 
monad. From the theory of monads (see [2], for example), we know that V then embeds 
into Aieas T , which has additional structure that is useful for dealing with other aspects 
of probability theory and decision making which V is not equipped for. Since Bayesian 
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probability requires the existence of regular conditional probabilities, we consider the Giry 
monad restricted to the full subcategory of M.eas defined by those measurable spaces with 
countably generated cx-algebras. Let us briefly recall the definition of a T-algebra. 

If (T, i], fi) is a monad in a category C, a T-algebra (X, a) is a pair consisting of an 
object X in C and a C arrow a : TX — > X such that the diagrams 



T 2 X 



TX 



X 



Vx 



TX 



Ta 



a 




a 



(40) 



TX 



X 



X 



commute; the first diagram is called the associative law and the second diagram the unit 



law. A morphism of T-algebras (X, a) 
the diagram 

TX 



■> (Y, j3) is an arrow / : X — > Y of C such that 
Tf 

> TY 



a 







(41) 



X 



f 



Y 



commutes. 

When T is the Giry monad, an algebra TX — — >■ X consists of a measurable space X 
with a countably generated cx-algebra, the space TX = TX of probability measures on 
X, and a measurable map a satisfying the two defining properties of a T-algebra. The 
measurable map Tf in the definition of a morphism of T-algebras is the pushforward 
map: given P G TX the pushforward by / is Tf(P) = f*P G 7Y. 

The T-algebras (X, a) are often called decision rules since the measurable map a 
assigns (decides) a value in X to each probability measure P on X. Alternatively, we 
can think of a decision rule as collapsing a probability distribution to a definite value, 
or derandomizing a probability distribution as in [5j. For this reason, we often use the 
descriptive characterization of Cencov [20J and call the category M.eas T the category of 
decision rules% 

Embedding the main consequence of the existence of regular conditional probabilities 
for Bayesian probability into Aieas T , we have the following. 

Theorem 5.1. Given a measurable function S: H — )■ TD, there exists a measurable 
function T: D — > TH such that T = {in o X is a retraction of S = ° TS in M.eas. 



3 Cencov did not work in Aieas T , but rather in V, restricting to measurable spaces X such that TX 
has a cr-algebra generated by finitely many atoms. The primary difference in the current approach and 
that of Cencov is that we take a Bayesian viewpoint, while he is attempting to describe the standard 
statistical inference perspective. 
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Proof. This follows immediately from Theorem 14 . 1 1 and the embedding of V into Aieas T 
and can be summarized in the diagram 




TH 



TD 



(42) 




□ 

One of the primary advantages to using the existence of regular conditional proba- 
bilities guaranteed by Theorem 13.21 is that we only require that measurable spaces have 
countably generated a-algebras and that the measures are perfect. In contrast, much 
of the previous work involving the Giry monad and regular conditional probabilities re- 
quires resorting to topological arguments and restricting to Polish spaces. For example, 
in jl], Doberkat characterizes the T-algebras for the Giry monad under the Polish space 
assumption, and proves the counter-intuitive result that there are no non-trivial decision 
rules for finite spaces in Aieas T . In contrast, we exhibit a finite space having a T-algebra 
when one does not require topological restrictions. 

Example 5.2. An important such case is the decision rule d: T2 — y 2 given by 



d(P) 



T ifP({T}) = l 
- ifP({T})<l. 

The function d is measurable since <i -1 ({T}) = {Sj} G S^ 2 - The associativity identity 



(43) 



T 2 2 



Td 



T2 



Q 



Td 



Qd~ 



f>2 



1*2 



d 



(44) 



T2 > 2 H2{Q) — > d(jn(Q)) = d{Qd- x ) 

d d 

where \x is the monad multiplication defined by /j l2 (Q)(A) = f qeT , 2 ) et u(<?) dQ, is satisfied 
since both routes map the element Q ^ 5s T € T 2 (2) i— > _L while 5s T (->■ T. The unit law 
Id 2 = d o rj2 is trivial to verify. 

The decision rule d: T2 — y 2 partitions the space T2 into 5t and all measures on 2 
whose value on {T} is of measure less than one. There are many other decision rules 
for 2, and any other finite or nonfinite space. Characterizing decision rules without the 
requirement for continuity is an open problem. 
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